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SMALL-WORLD MCMC AND CONVERGENCE TO 
MULTI-MODAL DISTRIBUTIONS: FROM SLOW 
MIXING TO FAST MIXING^ 

By Yongtao Guan and Stephen M. Krone 

University of Chicago and University of Idaho 

We compare convergence rates of Metropolis-Hastings chains to 
multi-modal target distributions when the proposal distributions can 
be of "local" and "small world" type. In particular, we show that by 
adding occasional long-range jumps to a given local proposal distri- 
bution, one can turn a chain that is "slowly mixing" (in the com- 
plexity of the problem) into a chain that is "rapidly mixing." To do 
this, we obtain spectral gap estimates via a new state decomposition 
theorem and apply an isoperimetric inequality for log-concave prob- 
ability measures. We discuss potential applicability of our result to 
Metropolis-coupled Markov chain Monte Carlo schemes. 



1. Introduction and main result. Many applications of Markov chain 
Monte Carlo (MCMC) involve very large and/or complex state spaces, and 
convergence rates are an important issue. A major problem in MCMC is thus 
to find sampling schemes whose mixing times do not grow too rapidly as the 
size or complexity of the space is increased. Guan et al. [8] used computer 
simulations to show that such problems can be handled simply and efficiently 
by using an idea from "small-world networks" [27] to make a slight change 
in a given proposal scheme. This change amounts to augmenting a typical 
local proposal distribution with low probability long-distance jumps that 
effectively contract the space and lead to much faster convergence to multi- 
modal target distributions. In this paper we make rigorous comparisons of 
the convergence rates of these two types of chains on R". We see this as 
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a first step in handling other complex state spaces, with the connection 
between and such spaces coming through possible embedding theorems. 

Let TT be a multi-modal probability measure on a convex set C R". 
We wish to compare convergence rates to this measure by two different 
Metropolis-Hastings chains that are characterized by their proposal distri- 
butions: "local" and "small world." From now on, we refer to these two types 
of Markov chains as "local chains" and "small-world chains," respectively. 
Intuitively, a local proposal distribution is one that has thin tails, so that 
the mean distance of a proposed move away from the current state is small 
compared to the distances between modes; by a small-world proposal, we 
mean a mixture of a local proposal and a heavy-tailed proposal, so that 
there are both small and large proposed moves away from the current state. 

In a multi-modal space a local chain will equilibrate rapidly within a 
mode, but takes a long time to move from one mode to another. Hence, the 
entire chain converges slowly to the target distribution. However, a small 
fraction of heavy-tailed proposals enables a small-world chain to move from 
mode to mode much more quickly. While this reduces the efficiency of equi- 
librating within a mode, it is a small price to pay and easily outperforms 
purely local proposals. This is the spirit of our main results. We derive 
bounds on the spectral gaps for such local and small- world chains and, hence, 
show how a small fraction of heavy-tailed proposals can turn a slowly mixing 
chain into a rapidly mixing chain. 

Throughout this paper, we assume the state space O is equipped with 
two measures: a reference measure, taken to be the Lebesgue measure fj,, 
and a Borel probability measure vr which serves as the target distribution. 
Suppose TT is absolutely continuous with respect to fi so that it admits a 
density 7r(x): 



Jb 

The most widely used Markov chain Monte Carlo method is the Metropolis- 
Hastings algorithm [9, 22], which we now describe briefly. 

1.1. Metropolis-Hastings algorithm. A transition probability kernel ^(x, dy) 
corresponds to a Metropolis-Hastings Markov chain on if it is of the form 



where k{x, y) is the proposal distribution and we say /c(x, y) induces P{x, dy), 



is the acceptance probability of a proposed move, 6x is the unit point mass 
at X, and 
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is the probability that the proposed move from x is rejected. It is easy to 
check that the transition kernel P{x, dy) satisfies the detailed balance equa- 
tion 7r{dx)P{x,dy) = ■K{dy)P{y,dx) as measures on Q x fi, so that P{x,dy) 
is reversible with respect to vr and, hence, has vr as an invariant measure. 
For simplicity, we consider only (spherically) symmetric proposal distribu- 
tions, k(x, y) = k{\x — ?/|), in which case the acceptance probability simplifies 
to a{x,y) = min(^||y, 1). [In typical cases for which the proposal chain is a 

random walk and {x : n{x) > 0} is path connected, the Metropolis-Hastings 
chain will be irreducible and, hence, vr is the unique invariant measure.] 

1.2. Geometric ergodicity and spectral gap. Let L'^{t^) denote the space 
of (Borel) measurable, complex functions on satisfying 

/ \f{x)\Mdx)<^. 
Jn 

This is a Hilbert space with inner product {f,g) = J^^ f{x)g{x)7r{dx) and 
norm ||/|| = (f,/)^^'^- The Metropolis-Hastings kernel P{x,dy) induces a 
contraction operator P on L^(7r) given by Pf{x) = Jq f{y)P{x,dy). We say 
the operator P is induced by a proposal distribution k{x,y) if the same is 
true of its transition kernel. P{x, dy) being reversible with respect to vr is 
equivalent to the operator P being self-adjoint, that is, 

{Pf,g) = {f,Pg), f,geL\7r). 

It is well known that the spectrum of P is a subset of [—1, 1]. [P being self- 
adjoint implies its spectrum is real, and P{x,dy) being a transition proba- 
bility kernel determines the range.] 

A chain is L^(7r)-geometrically ergodic if there exists 7 < 1 such that 

(2) ||//o^'"-vr||<7"||A^o-vr|| 

for any nonnegative integer n and any probability measure /xq £ -^^(^) 
liQ<^iT with /l^pdvr < 00). Roberts and Tweedie [26] have shown that 
convergence in Lr implies convergence in "total variation" norm 

IImi - M2||tv = sup |/ii(^) - ^■2{A)\ = \ I 1/1 (x) - /2(x)| dx, 

where fi{x) = dfii/dx. 

Let LQ{7r) denote the orthogonal complement of the constant function 1 
in L2(7r): 

LUtt) = |/ G LHtt) : if, 1) = 1^ f{x)7Tidx) = o}. 

Clearly, as a subspace of L'^{tt), ('''") ^^^^ ^ Hilbert space. Denote by 
Pq the restriction of P to LQ(7r). Chan and Geyer [5] proved that, for a 
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geometrically ergodic chain, Pq has no point spectrum (i.e., eigenvalues) of 
value ±1. In addition, it has been shown [25, 26] that, for reversible Markov 
chains, geometric ergodicity is equivalent to the condition 

(3) ||Po||= sup ||Po/||<l, 

/eigW,||/||<i 

and any 7 G [||Po||)l) satisfies equation (2). The spectral gap of the chain P 
is defined by 

Gap(P) = l-||Po||. 

Thus, the spectral gap provides a measure of the speed of convergence of 
a Markov chain to its stationary measure. Two of the main tools for study- 
ing spectral gaps in the setting of MCMC are conductance and Cheeger's 
inequality, to which we now turn. 



1.3. Conductance and Cheeger\s inequality. Let P be a Markov transi- 
tion kernel that is reversible with respect to vr. For A C ^2 with 'k{A) > 0, 
define 

(4) hp(A) = -i- / P(x,AXa!x). 

TT[A) J A 

The quantity h.p{A) can be thought of as the (probability) flow out of the 
set A in one step when the Markov chain is at stationarity. Notice that 
TT^dx) /'k{A) is the conditional stationary measure on the set A. 
The conductance of the chain is defined by 

(5) hp= inf hp (A). 

0<7r(A)<l/2 

Note that < hp < 1. Intuitively, small hp implies that the chain can be- 
come stuck for a long time in some set whose measure is at most 1/2, making 
it difficult for the chain to sample the rest of the distribution. As a result, 
such a chain converges slowly to the stationary measure. On the other hand, 
a large hp implies that the chain travels around swiftly and, hence, samples 
different parts of the distribution efficiently. As a result, such a chain con- 
verges rapidly. Lawler and Sokal [14] have quantified this as a generalization 
of Cheeger's inequality. 



Theorem 1.1 (Cheeger's inequality). Let P he a reversible Markov tran- 
sition kernel with invariant measure vr. Then 

(6) ^ < Gap(P) < 2hp. 
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Next, suppose that a proposal distribution k{x, y) is a mixture of two pro- 
posal distributions ki{x,y) and k2{x,y). That is, k{x,y) = (1 — s)ki{x,y) + 
sk2{x,y), for some < s < 1. Suppose operators P, Pi and P2 are induced 
by k{x,y), ki{x,y) and k2{x,y), respectively. Clearly, 

(7) = (1 - s)Pi + SP2 

and, for any measurable set A, h.p{A) = (1 — s)hpj(A) + sh.p2{A). As an 
immediate consequence, we have the following lemma showing that conduc- 
tance acts like a concave function on transition kernels and the spectral gap 
can be bounded from below by one of the components. 

Lemma 1.2. Suppose a reversible chain has a mixture kernel defined by 

(7) . Then the conductance of the chain satisfies hp > (1 — s)hpj -|- shp^. In 
addition, 

(8) GMP)>ia-sfhl^. 

Proof. From (5), 

hp= ini (il-s)hp,iA) + shp,{A)) 

0<7r(A)<l/2 

>(l-s) inf hpJA)+s inf hpJB) 

0<n{A)<l/2 0<n{B)<l/2 
= (1 - S)hp, + Shp^ > (1 - S)hp,. 

Combine this with Cheeger's inequality (6) to get (8). □ 



1.4. Definitions and main result. Let | • | be a norm on J7 C R" and Br{x) 
the n-dimensional ball centered at x with radius r. Denote by dBr{x) the 
surface of the ball, and write Tr~^{dA) for the surface measure (relative to vr) 
of a set A in the sense that 

vr^(5A)=liminf^^(^l!l^^, 

where = {x G : 3 a E ^, |a; — a| < e} is the e-neighborhood of A, consist- 
ing of the union of A and its "e-boundary" A^ \ A. 

We say the measure vr is log-concave if it has a density with respect to 
^ of the form 7r(x) =exp(— where V -.0,^ [—co^+co] can be an ar- 
bitrary convex function. Examples of log-concave distributions include uni- 
form, exponential, normal and gamma distributions. For technical reasons, 
we restrict our attention to "smooth" log-concave functions (but see discus- 
sion at the end of Section 3). We say a log-concave function exp(— y(a;)) 
is a-smooth if for any we have \V{x) — V{y)\ < a\x — y\. By Borell's 
theorem [4], the tail of tt{x) is exponentially deceasing, that is, there is a 
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number u-,^ > 0, such that tt'^ {dBr{P)) < cexp{—i'Tjr), for some constant c. 
(This is also easy to check directly for most examples.) We will refer to i^-^ 
as a decay exponent for vr. Define the first absolute centered moment of vr 
as = Jq\x — P\7r{dx), where P = Jq xn^dx) is the barycenter of vr. 

Next, we characterize the multi- modal distributions that will serve as our 
target distributions. Let Q = AiU ■ ■ - U Am be a partition of the state space 

into disjoint convex subsets. Suppose concentrated on each Ai we have 
a single a-smooth log-concave probability measure vTj with decay exponent 
Uj^. and barycenter Pi £ A^. Let dij = \Pi — I3j\, i j, denote the pairwise 
distances between barycenters. The target distribution of interest is then 
defined mixture of these log-concave densities: 



where c is a normalization constant and 1^. is the indicator function of A^. 
When the modes have different smoothness parameters, we take a to be the 
largest such. 

We will refer to features of the above probability measure vr that present 
barriers to mixing in the local Metropolis-Hastings chain as the "complexity 
of the target distribution." These include [if < oo], dij and i^ttj - In 
particular, we say a given chain is slowly mixing in the complexity of it if the 
spectral gap of the chain is an exponentially decreasing function of at least 
one of these quantities. We say a chain is rapidly mixing in the complexity 
of IT if the spectral gap is a polynomially decreasing function of all of these 
quantities. 

To make our calculations concrete, we will always use for our symmetric 
local proposal distribution k{x, y) a uniform distribution on an ?i-dimensional 
ball with radius 5. Such a proposal distribution captures the essence of "lo- 
cal proposals" and is easier to handle than other light-tailed proposals. We 
will sometimes refer to such a local proposal scheme as a "(5-ball walk." 

Let y) be a heavy-tailed distribution, that is, one for which the tails 
decrease polynomially, instead of exponentially, on $7. (For concreteness in 
exposition, we shall restrict ourselves to Cauchy distributions when is 
unbounded, and uniform distributions when VL is compact.) A small-world 
proposal distribution g{x,y) is a mixture of local and heavy-tailed distribu- 
tions: 



m 



(9) 




i=l 



(10) 



9{x, y) = (1 - s)k{x, y) + sh{x, y) 



for some s G (0, 1). 

We are now ready to state our main result: 
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Theorem 1.3. Let vr be the multi-modal probability measure defined by 
(9) with a-smooth log-concave modes. Let k(x, y) be the local proposal distri- 
bution and let g{x,y) be defined by (10), where h{x,y) is a heavy-tailed pro- 
posal. Then the local Metropolis-Hastings chain induced by k{x,y) is ^^slowly 
mixing," and the small-world chain induced by g{x,y) is ^^rapidly mixing" 
in the complexity of vr . 

Note that the local component of the small-world chain is the same as in 
the local chain. 

The rest of the paper is organized as follows. In the next section we prove a 
new version of the state decomposition theorem of Madras and Randall [19]. 
This will play an important role in proving our main theorem. On each log- 
concave piece, an upper bound on conductance is easy to obtain. However, 
the lower bound requires some extra work. Thus, we devote Section 3 to 
finding a lower bound through an isoperimetric inequality for log-concave 
probability measures. The proof of the main theorem is given in Section 4. 
In Section 5 we discuss possible applications of our result to convergence 
rates in Metropolis-coupled Markov chain Monte Carlo. 

2. State decomposition theorem. In this section we state and prove a 
new version of the state decomposition theorem of [19]. The setup of the 
new theorem is the same as that of their paper, but we repeat it here for 
convenience. Recall that {^i, . . . ,Ara} is a partition of fi. We describe the 
"pieces" of a Metropolis-Hastings chain P by defining, for each i = 1, . . . ,m, 
a new Markov chain on Ai that rejects any transitions of P out of Ai. The 
transition kernel P^i of the new chain is given by 

(11) PA,{x,B) = P{x,B) + lB{x)P{x,A^i) ior xeAi,BcAi. 

It is easy to see that P^. is reversible on the state space Ai with respect to 
the measure tTj, which, by definition, is the restriction of vr to the set Ai. 

The movement of the original chain among the "pieces" can be modeled 
by a "component" Markov chain with state space {1, . . . ,m} and transition 
probabilities: 

(12) PniiJ) = I P{x,Aj)7r{dx) for i^j, 

and PH{i,i) = ^ — J2j^iPH{'i',j)- This definition is quite similar to the defini- 
tion of the quantity hp{A), except for the 2 in the denominator. The reason 
for this factor will become clear as we progress. 

Our theorem is more or less a direct application of the following lemma, 
which is due to Caracciolo, Pelissetto and Sokal, and was recorded, together 
with its proof, in [19] as Theorem A.l. 
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Lemma 2.1 (Caracciolo, Pelissetto and Sokal). In the setting stated at 
the beginning of this section assume that P{x,dy) and Q{x,dy) are transi- 
tion kernels that are reversible with respect to vr. Assume further that Q is 
nonnegative definite and let Q^l'^ denote its nonnegative square root. Then 

(13) Gap(Qi/2pQi/2) > Gap(Q) f min Gap(PAjY 

\i=l,...,m / 

where 

Q(^>j) = -7TT / Q{x,Aj)7r{dx) foriy^j, 
and Q{i,i) = 1 - Ej^j Q(«, j) • 

Theorem 2.2 (State decomposition theorem). In the preceding frame- 
work, as given by equations (11) and (12), we have 

(14) Gap(P)>iGap(PH)f min Gap(PA. 

\i=l,...,rn 

Remark 1 . The theorem says the spectral gap for the whole Metropolis- 
Hastings chain can be bounded below by taking into account the mixing 
speed within each mode and the mixing speed between different modes. 

Proof of Theorem 2.2. Let Q = i(/ + P), where / is the identity 
kernel. Reversibility of Q with respect to vr follows from the same property 
for P. To see that Q is a nonnegative definite (and, hence, can be used in 
Lemma 2.1), note first that since P is a self-adjoint probability operator, its 
spectrum is a subset of [—1,1] and, hence, ||P|| < 1. Thus, 

{Qf, f) = {\{I + P)f, f) = i((/, /) + {Pf, /)) > i(l - ||P||)||/f > 0. 

Since Q = ^(/ + P), and Q^^'^ always commutes with Q, we have that 
Q^^"^ and P commute. It follows that 

gl/2pgl/2 ^ Qp 

Furthermore, setting 7 = ||Po||, we have Gap(P) = 1 — 7 and, as a simple 
consequence of the spectral mapping theorem, Gap{QP) = 1 — (1/2)7(1 + 
7). Thus, 2Gap(P) - Gap(QP) = 2(1 - 7) - (1 - (1/2)7(1 + 7)) = (1 - 
7)(1 — 7/2) > 0, and hence, 

(15) Gap(P) > i Gap(QP) = i Gap(Qi/2pgi/2)_ 
Following the definition in Lemma 2.1, we have 

lA,Qix,AjMdx) _ J^^{I{x,Aj)+P{x,Aj))7r{dx) 



Qihj) 



1T{Ai) 2TT{Ai) 

(16) 

_ lA^Pjx^AjMdx) 
27r{Ai) 
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which is just PH{i,j)- 

Combine equations (12), (13) and (15) to finish the proof. □ 

The same result has been obtained in [21]. However, their proof was not 
apphcable in the general situation for which P is not nonnegative definite. 

There is, of course, a resemblance between our state decomposition theo- 
rem and that of Madras and Randall [19]. We note that, first, our conclusion 
appears to be a bit stronger than theirs in that our result does not depend on 
the number of overlapping "pieces"; second and more important, in the orig- 
inal theorem the connection between different "pieces" of the state space is 
made via overlapping of the different "pieces." Jarner and Yuen [10] have ap- 
plied the original theorem to estimate the convergence rates of 1-dimensional 
local chains. Unfortunately, the original theorem is not readily applicable to 
small-world chains because such chains can move from one region to another 
even when the two regions are not overlapping. On the other hand, in our 
theorem the connection between different "pieces" is made via the "proba- 
bility fiow" from one region to another. We emphasize that having a chain 
that jumps from one region to another without visiting the valleys in be- 
tween is the key to sampling a multi-modal space efficiently. This is discussed 
in [8]. In particular, the combination of the Hastings ratio and small- world 
proposals results in most of the accepted long-range jumps being directly 
from mode to mode, and not from modes to "valleys." 

3. Lower bound for conductance. To apply the state decomposition the- 
orem to a multi-modal probability measure defined by (9), we need a lower 
bound on the conductance (hence, spectral gap) for each log-concave piece 
of the distribution. For this, we use an isoperimetric inequality. 

The idea of using an isoperimetric inequality for log-concave probabil- 
ity measures to obtain a lower bound on the conductance of local chains is 
rather straightforward and has been used by many authors, including Ap- 
plegate and Kannan [1], Kannan and Li [11] and Lovasz and Vempala [17]. 
Isoperimetric inequalities for log-concave probability measures have been 
studied by Bobkov [3] and Kannan, Lovasz and Simonovits [12]. As noted 
in [3], although the result presented in [12] was for a uniform measure on a 
convex set, their method, in fact, extends naturally to general log-concave 
probability measures. The isoperimetric inequality in [12] was studied using 
a "localization lemma" developed by [16] which essentially reduces integral 
inequalities in an n-dimensional space to integral inequalities in a single 
variable. The original form of the result, applied to uniform measures, is the 
following, recorded as Theorem 5.2 in [12]. 

Theorem 3.1 (Kannan, Lovasz and Simonovits). Let K be a convex set 
and K = Ki U K2 U a partition of K into three measurable sets such that 



10 



Y. GUAN AND S. M. KRONE 



the distance between Ki and K2 is d{Ki,K2) > 0. Let b= ^^j*^^^ Jj^xdx be 
the bary center of K and Mi{K) = jj^\x — b\dx. Then 

Mi{K ) 

The following is the log-concave version of the above isoperimetric in- 
equality. See also [18], Theorem 2.4. 

Theorem 3.2. Suppose tt is a log-concave probability measure on a con- 
vex set K. Suppose further that vr has barycenter and set M.,^ = f^ \x\'K{dx) . 
Let K = Ki U K2 U B be a partition of K into three measurable sets such 
that the distance between Ki and K2 is d{Ki,K2) > 0. Then 

vr(S) > ^^d{K\,K2)7r{Ki)7riK2). 

As remarked above, the proof of Theorem 3.1 extends to Theorem 3.2 
via the "localization lemma" on log-concave probability measures [12], The- 
orem 2.7. 

The next lemma makes the connection between Euclidean distance be- 
tween two points and the total variation distance between the one-step 
Markov transition kernels starting from those two points. Both the idea 
and the proof are borrowed from [18]. 

Lemma 3.3. Let K C R" be convex and suppose u,v £ K satisfy \u — 
v\ < for some 6 > 0. Suppose further that P{x,dy) is a Metropolis- 
Hastings transition kernel induced by a 6-ball local proposal and having an 
a- smooth log-concave target distribution n on K . Then 

|lP(n,-)-P(«,-)l|tv<l-ie-"^. 

Proof. Let Bs{u) and Bs{v) be the balls of radius 5 around u and v, 
respectively. Write yol{Bs) for their Euclidean volume and set C = Bs{u) n 
Bs{v). Since \u — v\ < we have vol(C) > ^vol{Bs). Since our target 
distribution is an Q-smooth log-concave function, the Hastings ratio is of 
the form 

Z[M ^ ^-\V{x)-V{y)\ > ^-a\x-y\_ 

7r{x) 

Thus, for any point x G C, the probability density for an accepted 5-ball 
move from u to x is at least voi(g^) C~"^i similarly for an accepted move 
from V to X. Thus, computing the total variation distance as 1 minus the 
"overlapping area," we have 



\P{n,.)-P{v,.)\U^<l-—^ I e-'^'^^{dx)<l-\e~'^' 

vol[Bs) Jc 2 



□ 
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Theorem 3.4. Suppose it is an a-smooth log-concave probability mea- 
sure on a convex set K. Suppose further that it has bary center and set 
= Jj^\x\7r{dx). Then the conductance, hp, of the Metropolis-Hastings 
chain with transition kernel P{x,dy) induced by the uniform 5-ball proposal 
satisfies 

^ - 1024VnM^ ' 
provided 5 is small compared to 

Proof. Let K = SiU 82-, where Si and ^2 are disjoint and measurable. 
We begin by proving that 

(17) / P[x,S2)7^{dx)> min(^(Si),7r(S2))- 

Now consider subsets that are "deep" inside 5i and ^2, in the sense that 
the Metropohs-Hastings chain is unUkely to move out of them in one step: 

S[ = {x(^Si:P{x,S2)<\e-'''} 

and 

5^ = {a;G52:P(x,5i)<ie-^}. 
First consider the case vr(S'[) < 7r(5'i)/2. Then 

/ P{x, S2Mdx) > ie-"^7r(5i \ S[) > ie-"V(5i), 
J Si 

which proves (17), provided we choose 6 small enough compared to 1/M^. 

So we can assume that 7r(5'[) > 7r(5'i)/2 and, by the same reasoning, 
7r(5^) > 7r(52)/2. Then, for any x e S[ and y e S'^, 

\\P{x, •) - P{y, Olltv > \P{x, Si) - P{y, Si)\ 

>l-P{x,S2)-P{y,Si) 

>l-\e-'. 

Applying Lemma 3.3, we obtain for any x ^ S'l and y £ S2 that 

\x-y\> 



and hence, d{S[,S2) > Set B = K \ {S[ U S2} and apply Theorem 3.2 
to the partition K = S[U S'2U B to get 
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From the above inequality and the simple fact that 



P{x,S2)Tr{dx)= P{x,Si)Tr{dx), 

Si JS2 



we obtain 



/ P{x,S2)7T{dx) = l [ Pix,S2)7r{dx) + l [ P{x , Si^dx) 
J Si ^ J Si ^ JS2 

>l [ P{x, S2)7r{dx) + \ I P{x, Si)7r{dx) 

^ JSiDB ^ JS2nB 

- g V ; 

in agreement with (17) since 7r(S'i)7r(52) > min(7r(S'i),7r(52))/2. 

Thus, we have verified (17). To finish the proof of the theorem, just notice 
that (17) implies, for every set Si satisfying vr(S'i) < 1/2 [and hence 7r(S'2) > 
1/2], that 

' Pix,S2)TT{dx)> 



and, hence, 



vr(5i) Jsi ^ ' ^ 1024 V^M^ 



inf hp{A) > 



o<7r(A)<i/2 ^ ^ ~ W24y/nM^' □ 



Remark 2. We have freedom in choosing 5. The optimal 6 (for the lower 
bound on conductance) is 5 = 1/a. With this choice, we have 

hp > ^ 



This choice of 6 makes sense. Imagine, for example, a chain starting at the 
apex of a 1-dimensional two-sided exponential density e""'^', with a large. 
A large value of 5 causes proposed moves to be rejected most of the time, 
resulting in slower mixing. However, a chain with small 5 has a reasonably 
large chance of moving away from the apex, and hence, mixes faster. 

In recent work, Lovasz and Vempala [18] were able to demonstrate fast 
convergence when sampling a log-concave distribution without the "smooth- 
ness" assumption. The technique they used was, loosely, to "smooth out" 
the distribution by convolving the log-concave density with a uniform distri- 
bution of small variance. It is interesting to put their idea into a probability 
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context. Suppose X and Y are two random variables such that X has a 
log-concave density, f{x). Suppose the probability density of Y is smooth 
and log-concave, with E\Y] = and Var(y) small. Then the sum of these 
two random variables, Z = X + Y ^ has a density, g{x)^ given by the convo- 
lution of two log-concave densities, and hence, is also log-concave [15, 24]. 
Intuitively, these two densities f{x) and g{x) should be close to each other 
if Var(y) is sufficiently small, and g{x) is smoother than f{x) on the scale 
of the ^Var(y). Y can be interpreted as a small perturbation and this 
perturbation determines, in a way, how close a chain can get to the target 
distribution (if one leaves out the smoothness assumption on density of X). 
The result of [18] essentially says that 



where is the starting measure, P is the Markov operator with target 
measure vr, e is a small term that determines the accuracy of the algo- 
rithm, M is a constant, and 7^ is the convergence rate that is determined 
by £. In fact, 7e = 1 — ^1/2, where is the e-conductance defined by 



can be bounded below by a quadratic function of e. 

In summary, if one ignores sets of small measure for a log-concave target 
density, a Metropolis-Hastings chain induced by a ball walk (even without 
the smoothness assumption on the target) is "geometrically ergodic." We 
would like to have directly applied this nice result, but we chose not to for 
two reasons. First, the state decomposition theorem applies in the context of 
spectral gap, while strictly speaking, equation (18) does not give geometric 
ergodicity, and hence, it can not be applied directly in the state decomposi- 
tion theorem. Second, if one chooses to cut off small sets, then all log-concave 
densities that decay faster than an exponential essentially have compact sup- 
ports, and hence, are "smooth." So the results in this section apply. We note 
here, however, that both Lemma 3.3 and Theorem 3.4 are borrowed from 
[17] with some modifications to apply arguments on conductance instead of 
e-conductance. 

4. Proof of the main theorem. 

4.1. A 1-D example. To gain some insight into the role of the complexity 
of the target distribution and the idea behind the proof of Theorem 1.3, 
we begin with a simple 1-dimensional example in which J7 is a circle with 
perimeter 4L for some L ^ 1; that is, the interval [— 2L,2L] with the two 
ends connected. Consider a two-mode target distribution 



(18) 



^0^"-^!! < Me + 7^*11/^0 -vr 



SUPe<^(A)<l/2 



7r(A)-e 



. They were able to show that the e-conductance 



(19) 




if X e [-L,L], 

if xG [-2L,-L] U [L,2L] 
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where c is the normaUzation constant. Here, we can think of L and u as 
determining the complexity of the target distribution; increasing v makes 
the modes more narrow, and increasing L increases the size of the space 
and places the modes further apart. We denote by vri the piece of vr defined 
on [— L, L] and by tt2 the other piece. We take for the local proposal the 
uniform distribution k(x,y) = 2/5 for y € [x — d,x + 6] and otherwise. Let 
Pk{x, dy) be the transition kernel for the Metropolis-Hastings chain based on 
this local proposal and having target distribution vr. Consider the partition 
A=[-L,L\, A^ = [-2L,-L]U[L,2L]. Then 

hp, < hp, (A) < < 2ce-'^(^-^). 

By Cheeger's inequality, we get 

(20) Gap(Pfc) < 2hp, < Ace"''^^-^'^ . 

Thus, the spectral gap for the local Metropolis-Hastings chain decreases 
exponentially in L and i^, finishing the first part of our proof for this example. 

Now consider a heavy-tailed proposal distribution h{x,y) = 1/4L, that is, 
a uniform distribution on $7, and the small-world proposal g{x,y) = (1 — 
s)k{x,y) + sh{x,y). Let Pg,A{x,dy) be the transition kernel for the small- 
world chain that is restricted to the set A. Then 

Pg^A{x,dy) = (1 - s)Pk,Aix,dy) + sPh,Aix,dy), 

where Pk,A and Ph,A are the restrictions to A of the kernels induced by 
k{x,y) and h{x,y), respectively. By (8), we have hp^^ > — ■s)hp, ^. It is 
easy to check that, for the two-sided exponential distribution, = 1/v. 
Then by Theorem 3.4, 

hp,,, > 

By Cheeger's inequality, we have 

(21) Gap(P,,^) > ^ > (1 - sf. 

By symmetry, the small-world chain that is restricted to A'^ has the same 
lower bound for its spectral gap. 

Also, by symmetry, the matrix of transition probabilities for the compo- 
nent chain has the form Ph = (^a" i-a)' spectral gap for this matrix 

is Gap(Pf/) = 2a. Now we calculate a = Ph{1, 2) . Set / = Jq lye'"'' dx. Then 
it{A) = 2c/. By (12), we have 
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(22) 



4/L 



(1 - e-"^ - uLe-"^). 

2IvL ^ ' 

When vL > 2, this yields P// (1, 2) > s/{AuL). Note that instead of just using 
the fact that 2'k{A) = 1, we chose to do the calculation the "hard" way in 
order to show that the normalization constant c has no effect on the spectral 
gap. 

Using the state decomposition theorem to combine (21) and (22), we have 

(23) Gap(P,)> "^' y^^" forz.L>2. 

Setting 5 = l/u in equation (23) leads to 

Gap(Pg) > ^ foTuL>2. 

For a small world chain, the lower bound on the spectral gap decreases 
linearly with both L and u. Moreover, the quantity \/v determines the 
absolute "size" of a mode, and hence, l/{vL) reflects the relative size of 
each mode. Thus, we can see how the spectral gap is influenced by the 
relative size of each mode. 

We have freedom in the choice of the value s. It is clear that s = corre- 
sponds to a pure local chain and s = 1 corresponds to the rejection method. 
Either case will make the right-hand side of (23) equal to 0, which either im- 
plies the lower bound is too rough, or the chain is slowly mixing. Note that, 
in the lower bound, the best value for s is 1/3, which maximizes s(l — s)^. 

Using a uniform distribution for h{x,y) does not make sense in an un- 
bounded space. However, this is not a problem because we can always use, 
say, a Cauchy distribution h{x) = ^ x^+b"^ ' where b is the half width at half 
maximum. Some prior knowledge about the target distribution will help in 
choosing 6 in a way that increases the lower bound on the spectral gap, and 
hence, the convergence rate of the corresponding small-world chain. Even 
in a bounded space, the use of a Cauchy distribution, instead of a uniform, 
may increase the convergence rate in cases for which most of the mass is 
accumulated in a small portion of the state space. 
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4.2. The general case. 



Proof of Theorem 1.3. The proof of the general case is similar in 
spirit to the one-dimensional case. For the first part of the theorem we want 
to show that, under a local proposal, the spectral gap is exponentially small. 
It is sufficient to prove that the one-step probability flow going out of at least 
one mode is exponentially small. Among all m pieces of the partition, at 
least one piece has measure no bigger than 1/2. Without loss of generality, 
suppose it is Ai. Consider any radius L > such that B = Bl{(3i) C Ai, 
where Pi is the barycenter of tti. Let be the operator induced by a local 
proposal k{x,y) given by a (5-ball walk. Then 



1 



vri(B) Jb 
1 



Pkix,B')n{dx) 

TT{x)k{x,y)^i{dy)fi{dx) 



7ri(S) Jb Jb 

7ri{B) Jl~5 
1 



< 



TTi{B)ui 

where the second inequality follows the fact J^c k{x,y)fi{dy) < 1, and we 
have written vi for the decay exponent of vri. 
By Cheeger's inequality, we have 

Gap(Pfc) < 2hp, < ^^e-'^^^^-^), 

and this finishes the first part of the proof. 

To prove the second part of the theorem, let = (1 — s)Pfc -I- sPh be the 
small world operator, where and P^ are induced by the local proposal 
k{x,y) and the heavy-tailed proposal h{x,y), respectively. Let Pg,Aj be the 
restriction of the operator Pg on the set Aj , and Pk^Aj , Ph,Aj be the restric- 
tions of Pk.Ph to Aj, respectively. We have Pg^Aj = (1 - s)Pk,A.j + sPh^Ay 

By Theorem 3.4 and Mj^. < c/i'j, we have 

hp 4 > — ^ 1-s 
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and hence, Cheeger's inequality implies 

(24) GMP,,A.)>-^^^T^{l-sf. 

Next we want to calculate PH{i,j)- Let b = maxj^j |/3j — j3j\ denote the 
maximum of the pairwise distances between barycenters. Let the heavy- 
tailed distribution be an n-dimensional Cauchy distribution with half width 
b: 

^^^'^^ " C„(|y-x|2 + 62){n+l)/2' 

where c,„ = r(^^^)/7r("''~^)/^ is the normalization constant. 

On each partition piece Ai pick a ball Bi = Br^ (Pi) C Ai such that 7r(i?j) = 
^7r{Ai). Let hi = infajggs^ 7r(x), the "height" of the density tTj along the 
boundary of Bi. Let Bf = Ai\ Bi be the complement of Bi on the set Ai 
and set Cij = mm(hi/hj,hj/hi). Then 



1= / h{x,y)mm{7r{y),Tr{x))n{dx)fi{dy) 
JAi JAj 

> / /i(2;,y)min(7r(j/),7r(x))/i((ix)/n((iy) 



^ Ib Ib'^ y)7r(y) min 1^ n{dx)n{dy) 

>Cij / Tr{x)h{x,y)fi{dx)fj,{dy) 
Jb'? Jb, 

TT{y)h{x,y)fi{dy)^j,{dx). 



Since h{x,y) = h{\x — y\) = h[r) decreases polynomially, while both ^{x) and 
7r(y) decrease exponentially, there exists a ball with radius wb such that 
■Ki{B^) > |7rj(^j), TTjiBui) > ^TTjiAj), and inf^g^^ h{r) = e/cn, where e = 
e{wb) is polynomially small in Note that ^^{Bi) = |7rj(^j) and Hi^Bj) = 
l^j{Aj), so 



>Cij _ TT{x)—fi{dx)fi{dy) 
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(25) +Cij / _ / TT{y)—fi{dy)fi{dx) 

From (12) and (25) we get 

J^,Pg{x,Aj)TT{dx) 



Puihj) 



^ 27r( Ai)^ 

(26) 



For anmxm stochastic matrix A = (aij), the spectral gap can be bounded 
from below [23] by 

Gap{A) > mminaij. 

Combining this with (26) results in 

(27) Gap(P^^) > -— niin(Q,vol(S,)). 

12c„, j^j 

Using the state decomposition theorem to put (24) and (27) together, we 
get 

(28) Gap(P,) > s{l - s)2-^^min(i.Je-2-.^)min(c,,vol(S,)). 
Setting (5 = 1/ maxj {uj) yields 

Notice that vol{Bj) decreases polynomially with an increase in i/j. This 
concludes the proof. □ 

Remark 3 . In the proof we essentially used a uniform distribution on a 
bounded set as a heavy-tailed distribution. Notice that, loosely, evol{Bj)/cn 
determines the relative size of mode j. In our lower bound as shown in (28), 
we have the so-called "curse of dimensionality": Cn increases exponentially 
with the dimension n. Interestingly, the best value for s in the lower bound 
is still 1/3. 
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5. Metropolis-coupled MCMC and simulated tempering. Metropolis-cou- 
pled Markov chain Monte Carlo (MCMCMC), proposed by Geyer [6], is in 
the same spirit as "simulated tempering," which was independently proposed 
by Marinari and Parisi [20]. Both are based on an analogy with simulated 
annealing [13], which is an optimization algorithm rather than a sampling 
scheme. It provides the useful metaphor of using some help from a "heated" 
version of the problem (that makes valley crossing easier by flattening the 
state space) to obtain the result in the original "cooled" version of the prob- 
lem one is interested in. Simulated annealing uses a specific form of "heating" 
that is sometimes called "powering up." If hi(x) is the unnormalized den- 
sity for the distribution of interest, ht{x) = /ii(x)^/*, for t > 1, are the heated 
unnormalized densities, including perhaps t = oo which gives tt{x) = 1. How- 
ever, as noted by [7], "powering up" is not an essential part of simulated 
tempering or of MCMCMC, and a different form of heating may work better 
in a specific real application. 

Let T = {1, . . . ,t}. Both MCMCMC and simulated tempering simulate a 
sequence of distributions specified by unnormalized densities hi(x) {i G T) 
on the same sample space ri, where the index i is called the "temperature," 
hi{x) is the "cold" distribution, and ht{x) is the "hot" distribution. Thus, an 
MCMCMC chain lives in a product state space x T such that, for a given 
i E T, the chain updates itself on Q, using a Metropolis-Hastings algorithm. 
For the move between different "temperatures," one keeps the x G and 
only updates the "temperature." Specifically, suppose a{i) (z = 1, . . . , t) is the 
auxiliary probability distribution for the temperatures. Then one iteration 
of the "Metropolis-Hastings" version of the simulated tempering algorithm 
is as follows [7]: 

1. Update X using a Metropolis-Hastings update for hi. 

2. Set j = i lb 1 according to probabilities gj^j, where qi^2 = Qm,m-i = 1 and 
Qi.i+i = Qi.i~i = 1/2 ifl<i<m (i.e., reflecting random walk on different 
temperatures). 

3. Calculate the Hastings ratio 

hj{x)a{j)qjA 
hi{x)a{i)qij 

and accept the transition (set i to j) or reject it according to the Metropo- 
lis rule: accept with probability min(r, 1). 

An implicit assumption in the simulated tempering algorithm is that, at 
each temperature, the proposal distribution that is used to generate a new 
move a; £ is local. For the sake of simplicity and clarity, let us assume 
that we have two temperatures, hot and cool, a(l) = a(2) = 1/2 and qi^2 = 
52,1 = 1- Then r in step 3 becomes hj{x)/hi{x), for i,j £ {1,2}. Suppose 
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now that the chain is at high temperature, /i2(x). If x is in a mode, then 
/ii(x)//i2(x) is close to 1 (by powering up), so that the chain is hkely to 
jump back to the cool state and collect samples. On the other hand, if x is 
in a valley, /ii(x)//i2(x) is small, so that the chain tends to stay at the hot 
temperature. When the hot chain has wandered far enough and proposes a 
move back to a cool temperature, it in fact proposes a move to the cool chain 
that is on average far away (as compared to the local proposal) from the 
state (in 0) where the chain last visits the cool temperature. In summary, 
if one is only interested in the samples collected in the cool state (i.e., the 
original distribution), then the only purpose of the hot state is to provide a 
far away proposal for the cool chain. This is the exact spirit of the occasional 
heavy-tailed proposals in the small-world chain. 

We note, however, that although simulated tempering, or MCMCMC, is 
a way to generate heavy-tailed proposals to overcome bottlenecks in il., the 
computational cost is heavy — much heavier than for a small- world chain. 
Moreover, it has been shown by Bhatnagar and Randall [2] that, in certain 
situations, the transition between different temperatures can have bottle- 
necks, which will slow down the frequency of "heavy-tailed" proposals, and 
hence, slow down the overall convergence. 

Nonetheless, if one can rule out the possible bottlenecks in transitions 
between the hot chain and the cool chain, our Theorem 1.3 for small- world 
chains readily applies to MCMCMC, or simulated tempering, to show that 
both of them are "rapidly mixing." 

Note that the different temperatures in simulated tempering in fact cor- 
respond to different amounts of heaviness of the tail in a small-world chain. 
Particularly, when O is compact, t = oo corresponds to the heavy-tailed pro- 
posal being a uniform distribution. Therefore, we propose that a promising 
scheme for using Markov chain Monte Carlo methods to solve hard problems 
would be to run multiple small-world chains in parallel with different chains 
having different heaviness of tails; for example, using different half-widths in 
Cauchy distributions, then coupling different chains via the Hastings ratio 
and Metropolis rule. 
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